Learning Low Rank Matrices from 0(n) Entries 
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Abstract — How many random entries of an n x net, rank 
r matrix are necessary to reconstruct the matrix within an 
accuracy 5? We address this question in the case of a random 
matrix with bounded rank, whereby the observed entries are 
chosen uniformly at random. We prove that, for any 5 > 0, 
C(r, 5)n observations are sufficient. 

Finally we discuss the question of reconstructing the matrix 
efficiently, and demonstrate through extensive simulations that 
this task can be accomplished in 7iPoly(log n) operations, for 
small rank. 

I. Introduction and main results 
A. Problem definition 

Let M be an n x m matrix of rank (at most) r and assume 
that ne uniformly random entries of M are revealed. Does 
this knowledge allow to approximately reconstruct M? 

The answer is negative unless the matrix has some specific 
structure. In this paper we assume that M is a random rank-r 
matrix, i.e. M = U • V where U is a n x r matrix with iid 
entries and V an independent r x m matrix with iid entries. 
The distributions of the entries of U and V are denoted, 
respectively as po and go- 

The metric we shall consider is the root mean square error 
(RMSE). If {M iiQ } are the entries of M, and M is its estimate 
based on the observed entries, we have 

1/2 



D(M, M) = ( — 



,a| 2 } 



(1) 



Notice that this coincides, up to a factor, with the distance in- 
duced by the Frobenius norm D(M,M) = ||M-M| \ F /y/rmi. 

In the following we shall denote by R 3 k, . . . the set 
of rows of M and by C 3 a, b, c, . . . its set of columns. The 
subset of revealed entries will be denoted by E C R x C. 

B. Motivation and related work 

Low rank matrices have been proposed as statistical 
models to describe a number of complex data sources. 
For instance, the matrix of empirical correlations among 
stock prices in a market is approximately low rank if price 
fluctuations are driven by a few underlying mechanisms [1]. 
A completely different application is provided by the matrix 
of square distances among n sensors in 3 dimension, which 
has rank r = 5 [2]. 

Low rank matrices have been proposed as a model for 
collaborative filtering data. As a concrete example we shall 
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focus here on the Netflix Challenge dataset [3]. This dataset 
concerns a set C of approximately 5 • 10 5 customers and 
R of 2 • 10 4 movies. For about 10 8 customer-movie pairs 
(i, a) <E E, the corresponding rating (an integer between 1 
and 5) is provided. The challenge consists in predicting the 
ratings of 10 6 non-revealed customer-movie pairs within a 
root mean square error smaller than 0.8563. 

One possible approach consists in considering the 
customer-movie matrix M (or a rescaled version of it) and 
assuming that it has low rank to predict the requested entries. 
Indeed, a simple coordinate descent algorithm that minimizes 
the energy function 



(i,o)6B 



(M j!a -(UV) l!a ) 2 + A||U||| + A 



(2) 



provides good predictions (within the Netflix competition, it 
was used by SimonFunk). 

In general, the matrix completion problem is not convex, 
and the descent algorithm is not guaranteed to converge to 
the original matrix M even if this is the unique rank r ma- 
trix consistent with the observations. A possible alternative 
consists in relaxing the rank constraint, by looking instead 
for a matrix M of minimal nuclear norm (recall that the 
nuclear norm of M is the sum of the absolute values of 
its singular values). The problem then becomes convex and 
indeed reducible to semidefinite programming. In [4] it was 
shown that this relaxation indeed recovers the original low 
rank matrix M, given that a sufficient number of random 
linear combinations of its entries are revealed. 

The case in which a random subset of the entries is 
revealed (which is relevant for collaborative filtering) was 
treated in [5], This paper proves that the convex relaxation is 
tight with high probability^] if e > Cm 1 / 5 logn. In particu- 
lar this implies two statements: (i) For e > C rn 1 ^ 5 logn, ne 
random entries uniquely determine a random rank-r matrix. 
(ii) This matrix is the unique minimum of a semidefinite 
program. 

C. Main results 

The results briefly reviewed above leave open several key 
issues: 

1. Why is it necessary to observe 0(n 6 / 5 ) entries to 
reconstruct a rank-r matrix, that has 9(n) degrees of 
freedom? 

2. As the Netflix challenge shows, it is not realistic 
nor necessary to reconstruct M exactly. What is the 

'Strictly speaking, the matrix model treated in [5] is slightly different 
from the one considered here. However it should not be hard to prove that 
the two models are asymptotically equivalent for large n. 



trade-off between RMSE distortion and number of 
observations? 

3. In general, semidefinite programming has <d(n 6 ) com- 
plexity [6], This is affordable up to n « 10 2 , but 
way beyond current capabilities when n w 10 5 as in 
modern datasets. 
In this paper we address the first two points and show 
that 0(n) observations are sufficient to reconstruct a low 
rank matrix within any positive distortion. 

Theorem 1.1. Let M = U • V be a random rank-r matrix 
with n rows and net columns and assume the distributions of 
Ui t k and Vfe. Q to have support in [—1, 1]. Let E be a random 
subset of ne entries in RxC. Then, with high probability, any 
rank-r matrix M such that |Mj. a — M,- i0 | < A for all (i, a) £ 
E, and with factors Uj^Vfe^ G [—1,1], also satisfies 

D(M,M) < A + 2r ?~ 1/2 log(lOe) , (3) 

where e = e/(l + a)r. 

Notice that the term A in the above inequality is un- 
avoidable. Since we are looking for matrices that match the 
observed entries only within precision A, we cannot hope for 
a RMSE smaller than A. In the second term, the factor 2r 
corresponds to the maximal distance between matrix entries 
in the present model, while the e-dependent factor tends 
to as e — > oo. Notice that e is exactly the number of 
observations per degree of freedom. 

The proof of this statement is given in Section [nil which 
also provides a much more accurate upper bound. The latter 
is -however- not straightforward to evaluate. While it is clear 
that small RMSE cannot be achieved with less than 0(n) 
observed matrix elements, Section [TV] proves a quantitative 
lower bound of this form. 

In Section [V] we address the question of efficient re- 
construction and demonstrate that O(nlogn) operations are 
sufficient to reconstruct random low rank matrices with 
rank r < 4, from 0(n) entries. Indeed such performances 
are achieved by a straightforward stochastic local search 
algorithm that we refer to as WalkRank or by a coordinate 
descent algorithm. A formal analysis of these algorithms will 
be presented in a future publication. Finally, in Section [VI] 
we use these results to compare random low rank matrices 
and the Netfiix dataset. 

Before dwelling on the intricacies of the full problem, 
the next Section discusses a particularly simple but perhaps 
instructive case: rank r = 1. 

II. A WARMUP EXAMPLE 

If M has rank 1, most of the questions listed above 
have a simple answer with a suggestive graph-theoretical 
interpretation. 

Assume that you know 3 entries of the matrix M that 
belong to the same 2x2 minor. Explicitly, for two row 
indices i,j 6 R and two column indices a, b 6 C, the 
entries Mi a , M j a , are known. Unless Mi a = 0, the 
fourth entry of the same minor is then uniquely determined 
M j<b = M j<a M i<b /M iia . The case M i|0 = can be treated 



separately but, for the sake of simplicity we shall assume 
that the distributions po, qo do not have mass on 0. 

This observation suggests a simple matrix completion 
algorithm: Recursively look for a 2 x 2 minor with a unique 
unknown entry and complete it according to the rule Mj,b = 
Mj.aMj.fj/Mia. As anticipated above, this algorithm has 
a nice graph-theoretic interpretation. Consider the bipartite 
graph G = (R, C, E) with vertices corresponding to the row 
and columns of M and edges for the observed entries. If 
a 2 x 2 minor has a unique unknown entry, it means that 
the corresponding vertices j e R, b e C are connected by 
a length-3 path in G. Hence the algorithm recursively adds 
edges to G connecting distance-3 vertices. 
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Fig. 1. Learning random rank-1 matrices. The continuous line 
is the optimal distortion (achieved by the recursive completion 
algorithm). Data points correspond to a 0(n) complexity local 
search algorithm. 

After at most 0(n 2 ) operations the process described halts 
on a graph that is a disjoint union of cliques, corresponding to 
the connected components in G. Each edge corresponds to a 
correctly predicted matrix entry. Clearly, in the large n-limit 
only the components with 0(n) matter (as they have 0(n 2 ) 
edges). It is a fundamental result in random graph theory that 
there is no such component for e < 1/y/a. For e > \j\fa, 
there is one such component involving approximately n£ in 
R and m£ vertices in C, where (£, £) is the unique positive 
solution of 

i = 1 - e- eaC , C = 1 - e~ £ « . (4) 

This analysis implies the following result. 

Proposition II.l. Let M = U -V be a random rank 1 matrix, 
and denote by £(e), £(e) the largest solution of Eq. @. Then 
there exists an algorithm with 0(n 2 ) complexity achieving, 
with high probability, RMSE 

D(M, M) = y/l - £(e)C(e) A) + 0(y/{logn)/n) . (5) 

where D = ^E(y i 2 )E(J7 2 ). Further, if the entries 11,, 
V have symmetric distribution, then no algorithm achieves 
smaller distortion. 

Proof. The mentioned distortion is achieved by the recursive 
completion algorithm, whereby matrix element correspond- 



ing to vertex pairs in distinct components are predicted to 
vanish. This is optimal if the matrix element distribution is 
symmetric. Indeed the conditional matrix element distribu- 
tion remains symmetric even given the observations. □ 

For massive datasets even 0(n 2 ) complexity is unafford- 
able. Figure Q] compares the minimal distortion guaranteed 
by Proposition III. 1 1 with the performances of the WalkRank 
algorithm described in Section [V] Here the factors Uj, V a 
where chosen uniformly in {+1,-1}. 

III. Upper Bound and Proof of TheoremITTI 

In this section we prove the upper bound on distortion 
stated in Theorem lI.il The proof proceeds in three steps. First 
we will consider the case in which the factor entries 11,^, 
Vj. a are supported on a finite set, and prove a (tighter) upper 
bound via a counting argument. Then we'll use a quantization 
argument to generalize this bound to the continuous case. 
Finally, we simplify our bound to get the pleasing expression 
in Theorem 11.11 Unfortunately this simplification entails a 
worsening of the bound. 

A. The discrete case 

We start by introducing a couple of new notations. Given 
a row index i E R, we let — (U^i, . . . , Ui. r ) be the i-th 
row of U . Analogously, for a E C, let be the a-th column 
of V. We then have 

Mi, a (6) 

We also write u° = . . . , u° l r ) and v% = (v° 1; . . . , v° r ) 
for the components of these vectors. These are assumed to be 
iid's with distributions po (for u) and qo (for v) supported 
on a finite set C R with \A^\ = N points. Typical 
examples are A^ = { — 1, +1} or A2M+1 = {—Me, — (M — 
l)e, . . . , (M — Me}). Our basic counting estimate is 
stated below. 

Proposition III.l. Let A > and be a random rank- 
r matrix with factors supported in An. Then, with high 
probability any rank-r matrix M with factors supported in 
An that satisfies |Mj a — Mi a | < A for all (i,a) E E also 
satisfies £>(M, M) < S(e, a, A) + o„(l), where 

S(e,a,A) = sup {d(p,q): cf> A (j>,q)>0}. (7) 

p<£T>(p ),q£T>(qo) 

Here the sup over p (over q) is taken over the space of dis- 
tributions T>(po) (respectively T>(qo)) over (An) t x (An) t 
such that X)sP("i u°) = Po(u°) (respectively J2vl(^' = 
Qo(v°))- The functionals appearing in Eq. are defined by 

d(p,q)^{E p , q \u-v~n° -^| 2 } 1/2 , (8) 

and 

<f> A (p,q) = H(p) - H(p ) + a[H(q) - H(q )]+ (9) 
+ eE po , go logPp^lu-w-u -^! < A I v ^}, 

Proof. Define Zq(A, S) (G is the bipartite graph with edge 
set E) as the number of matrices M of the form (0 such 
that: 



(1) |Mi, Q - Mi, a \ < A for all (i,a) E E; 

(2) D(M,M) > 5. 

This can be written as 

Z G (A,S)= Yl II Id", •^-^■^|< A), 

{ui,v a }eC(S) (i,a)£E 

where C(6) is the set of vectors that satisfy condition (2) 
above. We further define the set of typical instances (M, E), 
Typ(7) through the following conditions: 

(a) Let 9\j( ■ ) be the type of factor U, namely nO\j(u) is 
the number of row indices i E R such that ui = u. 
Then for (M,E) E Typ(7), we have D(6»u|M < 7- 

(b) Analogously, for the type of factor V we require 
D(6v\\q ) < 7- 

(c) Finally, let Oe{ • , ■ ) be the edge type, i.e. ne8E(u, v) 
is the number of edges (i, a) E E such that u. L = u 
and v a — v. We then require D(9y\ \po -qo) < 7 (where 
Po • qo is the product distribution on u, v). 

By standard arguments [7] we have P{Typ(7)} — * 1 for any 
positive 7 n — ► 00. We then define 

Z G (A, S) = Z G (A, 6) I{(M,E) E Typ( 7 )) . (10) 

According to lemma ITH.21 the expectation of Zc(A,5) 
vanishes as n tends to infinity for S > 5(e, a, A). Since 
P{Typ(7)} — > 1 and using Markov inequality, this implies 
that lim ¥{Zq(A, S) > 0} = 0. In conclusion, any matrix 

n — >oo 

M that satisfies |M ia -M ia | < A for all (i, a) E E results in a 
distance metric smaller than 5(e, a, A) with high probability, 
as n tends to infinity. □ 

Lemma III.2. For any S > 8(e, a, A) there exists 7 > 
such that lim m{Zq(A, 5)} = 0. 

n — >oo ' 

Proof. Zg(A, S) is a random variable where the randomness 
comes from the matrix elements a and the choice of 
the sampling set E. Since E is uniformly random, we can 
take any realization of M = U • V from the typical set 
according to iid po and iid qo. Given one such realization 
of U = (u?, . . . ,v°A and^V = . . . x v^), go through 
all the estimations M = U • V, where U = (uj., . . . , u n ) 
and V = (vi, . . . , v m ). Now group the set of assignments 
U and V that have the same empirical distribution, and let 
p(u, u°) and (7(1/, denote the joint distribution. Then, 
the number of different assignments with same empirical 
distribution (p,q) is e n{H(p)-H(p )}+ m {H(q)-H(q )} _ For 

each distribution pair (p, q) that satisfy condition (2) above, 
we fix the factors U and V and compute the probability 
that they satisfies condition (1). Denoting by W E M {- • • } = 
Ee,m{- --K{E, M) E Typ(7))} the expectation restricted to 



(E, M) S Typ(7), we have 
E^, M {Z G (A,<5)} 



l.m ^ e n 

{Si,v a }i£C{S) (i,a)<£E 
y e nH(p\p )+mH(q\q ) 

P eT>(po),q£V(q ) 
d(p,q)>6 



021 



<A) 



n i (k 

(i,a)G-E 



^l<A) 



To compute the expectation in the last inequality, we look at a 
typical realization of E and partition it into subsets {E^o $0}, 
for (3°, v°) G (A N ) r x (A N ) r , defined as follows, (i, a) e E 
is in EfiO fia if u° = v° and = v°. By definition |J5go | = 
ne0E(uo,VQ). Further .Ego $o is uniformly random given its 
size. Within the typical set Typ(7), 9e(uq,vq) is close to 
Po(if > )qo(v ). We thus get 




^I<A) 



2err(5)) + 2err((5) + o n (l), where 8{e, a, A) is defined as 
in Eq. ant/ err(i5) is ?/ze quantization error which only 
depends on 5. 

Proof. Let M" 5 be the quantized version of the original matrix 
M, which is defined as follows. Define uf € (Ag) r and v s a 6 
(v4a) r to be the quantized version of Ui and z7 a respectively, 
where it, is the i-fh row of U and v a is the a-th column V. 
Then, M s is defined as, 

M.f, a = < • t ■ 

Note that Mf satisfies |M,- i0 -M? | < err (5). Analogously, 
define M 5 to be the quantized version of the estimated matrix 
M. Then, the M s and M s satisfy |Mf >a -Mf J < A+2err(<5) 
for all (i, o) e E. 

Let <5(e, a, A) be the upper bound in proposition IIII.ll 
Then, the distortion is bounded with high probability by 

£>(M,M) < D(M,M S )+D(M 5 ,M 5 )+D(M 6 ,M) 

< 5{e 1 a,A + 2err(6)) + 2err{5) . (12) 

Note that twice the quantization error is added to A since 
now we only have |M| - Mf Q | < A + 2err(5) for all 
(i,a)eE. □ 



= n e ^ \ n ^ ' - ui ■ < a) 

= n p{i«-^i<a|^}' 



_0 hoi nefl E (u°,i7 ) 



Finally, we get, 



E' EM {Z G {A,5)}<e n ^ Yl 



p£V(p ),qeV(q ) 
d(p,q)>5 



(ID 



where ^(7) — > as 7 — > 0. For (p, 3) that satisfies c£(p, q) > 
S(e, a, A), we know that 4>a(p, (?) < by definition. Hence, 
for 7 small enough, <5 > 5(e, a) is a sufficient condition for 
lim E E , M {Z G (A,6)} = 0. □ 



73. General distributions via quantization 

Above tighter upper bound can be generalized to matrices 
in theorem [TTJ via quantization argument. In this section, 
we're interested in recovering a continuous real valued matrix 
M from samples of its entries. First, we estimate it using 
factors Uj^, Vk,a supported in the continuous alphabet. 
Then, the distortion is bounded using the upper bound from 
section IIII-AI via quantization. 

Proposition III.3. Let A > and M be a random rank- 
r matrix with factors supported in continuous bounded 
alphabet A c . Let Ag be discrete quantized alphabet of A c , 
with maximum quantization error less than 6/2. M is the 
rank-r estimation with factors supported in A c . Then, with 
high probability, any matrix M that satisfies |Mi a — | < A 
for all (i,a) G E also satisfies D(M, M) < 5(e, a, A + 



C. Simplified bound 

The (tighter) upper bound in proposition IIH. 1 1 is not easily 
computed. To get a bound that can be analyzed, we relax the 
constraint <j>A > and get a relaxed or simplified upper 
bound on <5(e, a, A). Furthermore, this simplified upper 
bound is used to prove theorem IlTI 

Proposition III.4. For all e > 0, a > and A > 0, we 

have 



S(e,a,A) < 



—2 ,-2 



d -(d - A^)exp - 



H(p\p Q ) + aH(q\q ) 



1/2 



where S(e, a, A) is defined as in proposition UII. 71 H(p\po) = 
max {H(p)}-H(p Q ), H(q\q ) = max {H(q)}-H(q ), 

peV(p ) qeV(q a ) 

and d — max{\u ■ v — vP ■ v°\}. 

Proof. Define the upper bound 8 (e, a, A) as 

T{e,a,A)= sup{d(p, g ) : <Pl(p,q) > 0} , (13) 

p£T>{p ) 
96£>(go) 

where T>(po), T>(po) and d(p, q) are defined in Eq. (0. The 
only difference is the relaxed constraint function <f)\, defined 



4>a(P> q) = H{p\p Q ) + aH(q\q Q ) + e log ( * ) . 

V d - A 2 / 

By Jensen's and Markov inequality, (f>\ (p, q) is larger 
than <pA(p,q)- This implies that the supremum in 
the simplified upper bound is taken over a larger set 
of distributions than the tighter upper bound, hence 



we have 5(c, a, A) < 5 (e, a, A). And after some 
computation, it's easy to show that <5 (e, a, A) 



{d 2 - (d 2 - A 2 ) exp (-i [H(p\p ) + aH(q\q )])} 1/2 
which concludes the proof. 



□ 



This simplified upper bound can be generalized, in the 
same manner, to the continuous support case. The following 
example illustrates this generalization and introduces bounds 
necessary in the proof of theorem HTl 

For the original matrix M = U ■ V, assume the distributions 
of U^fc and Vk,a to have support in As = { — 1,-1 + 
5, ... ,1 — 6,1}. Also, the factors of the rank-r solution M 
are supported on the same discrete set. Then, the simplified 
upper bound is given by 



6 (e,a,A) = 
A 2 + (4r 2 - A 2 ) 



1 — exp 



log TV 



1/2 



where A 



\A$\ and e 



e/(l + a)r. Note that 



lim (5 (e, a, A) = A, which means that we cannot get 

e — *oo 

RMSE smaller than A. 

The maximum quantization error associated with M, i0 is 
r(6— S 2 /4), which happens when all the entries of and v% 
are l — S/2 and quantized to 1. For simplicity, err(S) = r5 is 
used. Combined with Eq. (fT2l) . we have a simple analytical 
upper bound on the distortion when the original matrix and 
the estimation have continuous support [—1,1]. 




Fig. 2. The upper bound in Eq. ( 112b with simplified upper bound 
S (e, a, A), for a = 1 and A = and a few values of the rank r. 



Proof of Theorem \I.1\ From the example above, we can 
compute the simplified upper bound directly to bound the 



distortion. 

D(M, M) 

< i 4r 2 - (4r 2 - (A + 2r<5) 2 ) ( exp 



< \ (A + 2rS) z + 4r 2 1 - exp 



- log TV 

€ 

log AT 



1/2 



+ 2r8 
2rS 



< A + 4rS + 2r 1 - exp - 



log A 



< A + ArS + 2r 



log A 



Remember N is defined as the alphabet size | Ag \ , where the 
discrete alphabet As — {— 1, — 1 + 8, ■ ■ ■ ,1 — 5,1} is used. 
Fixing S — j^zt, we can minimize the right hand side of 
the last inequality with respect to the alphabet size A. Since 
the exact minimizer cannot be represented in a closed form, 



we use instead an approximate minimizer A 
which results in 



4V? + 1, 




< A + log (10?) , 



(14) 



where the last inequality in ( fT4l is true for e > 1.5. This 
is practical since we are typically interested in the region 
where '-^B < 1. 

□ 



V? 



IV. Lower Bound 

When the number of observed elements is smaller than 
0(n), high distortion is inevitable. In this section we derive 
a quantitative lower bound which supports this observation. 

Proposition IV. 1. Let M = U • V be a random rank-r matrix 
with n rows and na columns and assume the distributions of 
Ui.fe and Vfc a to have support in [—1, 1], and E a random 
subset of ne row-column pairs. Then, with high probability, 
any rank-r matrix M such that \Mi, a — | = for all 
{i,a) G E, also satisfies 



D{U,U) >c-e 



(15) 



where c is a strictly positive constant that only depends on 
the rank r and the initial distributions po and qa. 

Proof. Think of the following algorithm which has clearly 
better performance than any other that satisfies the assump- 
tions. Consider the bipartite graph G = (R, C, E) with 
vertices corresponding to the row and columns of M and 
edges for the observed entries. For every pair of row and 



column indices (i,a), i E R and a E C, that is not connected 
by an edge, we do the following. If degree of i (a) is less 
than r, we assume that all the neighbors of node i (a) are 
known and make MMSE estimation of it? (v%). If degree of 
i (a) is greater than r — 1, we assign the correct value of 
m° (v%). With high probability the resulting RMSE is greater 
than 6_(e, a) as defined below. 



<J(e,a) = V(l- (1-0(1-0); 



(16) 



where £ = ¥{degree(i) < r} = ~~fcr e_e ' ^ 



1 {degree(a) 



< r} = g.^- fc - 



fc=0 



and c = 



min{E{«? • (v° a - <)},E{(u° - u'A ■ ^}}. Here, and 
v' a represent the MMSE estimate of and respectively, 
assuming that j — 1 neighbors and corresponding edges are 
known. 

Without loss of generality, assume a > 1. Then, we can 
simplify above bound to get, Eq. ([T5T > □ 



Simplified Upper Bound 
Upper Bound 
Lower Bound 




0.25 



0.125 - 



Fig. 3. The upper bound S(e, a, A), the simplified upper bound 
S (e, a, A) and the lower bound <5(e, a) for rank r = 2, a = 1, 
A = 0. Here the factors U;fc, Vt a take values in { — 1, 0, 1}. 



V. Efficient matrix completion 

In the previous sections we proved that 0(n) random 
entries determine a random low rank matrix within an 
arbitrarily small RMSE. How hard is it to find such a matrix? 
In this section we present a numerical investigation using a 
low complexity stochastic local search algorithm that we call 
WalkRank. 

WalkRank is inspired by successful local search algo- 
rithms for constraint satisfaction problem, such as WalkSAT 
[8]. It is particularly suited to low -rank matrices whose fac- 
tors Uj,fc, Vfe a take values in a finite set An- The algorithm 
tries to find assignments of the vectors {ui,...,u n }, and 
{v\, . . . , v m } that minimize the cost function 



C({5i,u„}) = 



> A) 



(17) 



which counts the number of observations Mi a that are not 
described by the current assignment. 

The algorithm initializes the vectors {Hi}, {v a } to random 
iid values and then alternates between two type of moves. 
The first are greedy moves, described here in the case of U 
factors. 

Greedy move, U factors 



Sample a column index i E C uniformly; 
Find u° m ' that minimizes C({ui,v a }) over Ui\ 
Set Ui <- uT" 



Greedy moves for V factors are defined analogously. 
The second type of move potentially increases the cost 
function. 



Walk move 



Sample (i, a) E E s.t. |uj • v, 
Find w"™ • t^ cw such that litf • r 
Set Ui <— u" cw , and v a <— i 



;\ > A; 
- M,„ I < A 



WalkRank recursively executes one of these moves, choos- 
ing a walk move with probability p, and a greedy one with 
probability 1 — p. The parameter p can be optimized over, 
and we found p « 0.1 to be a reasonable choice. 




Fig. 4. Performances of the WalkRank algorithm on random rank 2 
matrices. The bold line is a lower bound on the distortion obtained 
by the maximum likelihood algorithm. 

In Figures [4] to [6] we present the distortion achieved by 
the WalkRank algorithm, averaged over 10 instances. We 
used factors with entries U,^, \lk,a uniformly distributed in 
{+1, —1}. It is clear that the resulting distortion is essentially 
independent of n over two orders of magnitude and decreases 
rapidly with e. 

We compare these numerical results with an analytical 
lower bound on the distortion achieved by a maximum 
likelihood algorithm. The latter fills each unknown position 
in M with its most likely value. While there exists no 
practical implementation of the maximum likelihood rule, 
we can provide a sharp lower bound on its performances 
using techniques explained in [9]. It appears that, for low 
values of the rank, WalkRank achieves the same distortion 
as maximum likelihood, provided it is given one or two more 
entries per column/row. 
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Fig. 5. Performances of the WalkRank algorithm on random rank 
3 matrices. 



Fig. 7. Typical evolution of the cost function under the WalkRank 
algorithm. Here the rank is r — 3, and e = 8. 




Fig. 6. Performances of the WalkRank algorithm on random rank 
4 matrices. 



The complexity of one WalkRank step is independent of 
the matrix size (but grows with the rank). The results in 
Figures @] to |6] were obtained with a number of steps slightly 
superlinear in n. In Fig.|7]we show the evolution of the cost 
function for averaged over 10 instances for n = 10 3 to 10 5 , 
r = 3 and e = 8. The number of steps per variable required 
to reach the asymptotic value increases mildly with n. A 
reasonable conjecture is that the number of steps scales like 
n-Poly(logn). 

VI. Back to the Netflix Data 

As shown in the last section, local search algorithms 
efficiently fit low rank matrices of very large dimensions, 
using few observations. They therefore provide an efficient 
tool for checking whether a dataset is well described by the 
random low rank model. 

In Figures [8] and [9] we compare the evolution of fit and 
prediction error for three matrices with n = m = 5 • 10 3 : 

1. A submatrix of the Netflix dataset given by the first 
5 • 10 3 movies and customers. 

2. A matrix with the same subset E of revealed entries, 
each of them chosen uniformly at random in [—!,+!]. 
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Fig. 8. Evolution of the fit error (top frame) and prediction error 
(lower frame) for fitting three matrices with a rank 3 model. The 
curves are obtained using coordinate descent in the factors. 



3. A random rank-3 matrix (for Fig. |8j or rank-5 matrix 
(for Fig. |5), with set of revealed entries as above. 

The fit error is defined by restricting the average in Eq. (Q~|) 
to (i, a) G E. The prediction error is instead obtained by 
averaging over (i,a) $ E. In the case of the Netflix matrix 
the latter was estimated by hiding 10 3 entries from the 
dataset, and averaging over those. 

We used a coordinate descent algorithm in the factors 
{ui}, {v a }, with regularized cost function given by Eq. (O. 
In agreement with the results of previous sections, random 
low rank matrices are efficiently fitted with small fitting and 
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Fig. 9. As in Figure [8] but for a rank 5 model. 



prediction error. The difference with iid entries is striking. 
The fit error decreases only slowly over time, while the 
prediction error actually increases. As expected, revealed 
entries do not provide any information on the hidden ones. 
Netflix data lie somewhat in between: both fit and prediction 
error decrease over time, albeit not as sharply as for genuine 
low rank matrices. 
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